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Abstract — The Lenstra-Lenstra-Lovasz (LLL) algorithm is the 
. most practical lattice reduction algorithm in digital communica- 
■ tions. In this paper, several variants of the LLL algorithm with 
either lower theoretic complexity or flxed-complexity implemen- 
tation are proposed and/or analyzed. Firstly, the 0{n'^\ogn) 
theoretic average complexity of the standard LLL algorithm 
under the model of i.i.d. complex normal distribution is derived. 
Then, the use of effective LLL reduction for lattice decoding 
] is presented, where size reduction is only performed for pairs 
of consecutive basis vectors. Its average complexity is shown 
• to be 0(71^ log n), which is an order lower than previously 
' thought. To address the issue of variable complexity of standard 
\ LLL, two flxed-complexity approximations of LLL are proposed. 

One is flxed-complexity effective LLL, while the other is flxed- 
' complexity LLL with deep insertion, which is closely related 
' to the well known V-BLAST algorithm. Such flxed-complexity 
structures are much desirable in hardware implementation since 
. they allow straightforward constant-throughput implementation. 



I. Introduction 

Lattice pre/decoding for the linear multi-input multi- 
output (MIMO) channel is a problem of high relevance 
in single/multi-antenna, broadcast, cooperative and other 
multi-terminal communication systems [l]-[4]. Maximum- 
likelihood (ML) decoding for a rectangular finite subset of 
a lattice can be realized efficiently by sphere decoding [2], 
[5], whose complexity can nonetheless grow prohibitively with 
dimension n [6]. The decoding complexity is especially felt 
in coded systems, where the lattice dimension is larger [7]. 
Thus, we often have to resort to approximate solutions, which 
mostly fall into two main streams. One of them is to reduce the 
complexity of sphere decoding, notably by relaxation or prun- 
ing; the former applies lattice reduction pre-processing and 
searches the infinite lattice', while the latter only searches part 
of the tree by pruning some branches. Another stream is lattice 
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reduction-aided decoding [10], [11], which was first proposed 
by Babai in [12], which in essence applies zero-forcing (ZF) or 
successive interference cancelation (SIC) on a reduced lattice. 
It is known that Lenstra, Lenstra and Lovasz (LLL) reduction 
combined with ZF or SIC achieves full diversity in MIMO 
fading channels [13], [14] and that lattice-reduction-aided 
decoding has constant gap to (infinite) lattice decoding [15]. 
It was further shown in [16] that minimum mean square error 
(MMSE)-based lattice-reduction aided decoding can achieve 
the optimal diversity and multiplexing tradeoff. In [17], it 
was shown that Babai's decoding using MMSE provides near- 
ML performance for small-size MIMO systems. More recent 
research further narrowed down the gap to ML decoding by 
means of sampling [18] and embedding [19]. 

As can be seen, lattice reduction plays a crucial role in 
MIMO decoding. The celebrated LLL algorithm [20] features 
polynomial complexity with respect to the dimension for any 
given lattice basis but may not be strong enough for some 
applications. In practice of cryptanalysis where the dimension 
of the lattice can be quite high, block Korkin-Zolotarev (KZ) 
reduction is popular. Meanwhile, LLL with deep insertions 
(LLL-deep) is a variant of LLL that extends swapping in 
standard LLL to nonconsecutive vectors, thus finding shorter 
lattice vectors than LLL [21]. Lately, it is found that LLL- 
deep might be more promising than block-KZ reduction in 
high dimensions [22] since it runs faster than the latter. 

Lattices in digital communications are complex-valued in 
nature. Since the original LLL algorithm was proposed for 
real-valued lattices [20], a standard approach to dealing with 
complex lattices is to convert them into real lattices. Although 
the real LLL algorithm is well understood, this approach 
doubles the dimension and incurs more computations. There 
have been several attempts to extend LLL reduction to com- 
plex lattices [23]-[26]. However, complex LLL reduction is 
less understood. While our recent work has shown that the 
complex LLL algorithm lowers the computational complexity 
by roughly 50% [26], a rigorous complexity analysis is yet 
to be developed. In this paper, we analyze the complexity 
of complex LLL and propose variants of the LLL algorithm 
with either lower theoretic complexity or a fixed-complexity 
implementation structure. 

More precisely, we shall derive the theoretic average com- 
plexity 0{ri^ log n) of complex LLL, assuming that the entries 
of B are be i.i.d. standard normal, which is the typical MIMO 
channel model. For integral bases, it is well known that the 
LLL algorithm has complexity bound 0{n'^\ogB), where 
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B is the maximum length of the column vectors of basis 
matrix B [20]. The complexity of the LLL algorithm for 
real or complex-valued bases is less known. To the best of 
our knowledge, [27] was the only work prior to ours [28] on 
the complexity analysis of the real-valued LLL algorithm for 
a probabilistic model. However, [27] assumed basis vectors 
drawn independently from the unit ball of M", which does not 
hold in MIMO communications. 

Then, we propose the use of a variant of the LLL 
algorithm — effective LLL reduction in lattice decoding. The 
term effective LLL reduction was coined in [29], and was 
proposed independently by a number of researchers including 
[30]. We will show the average complexity bound 0(n^ logn) 
in MIMO, i.e., an order lower than that of the standard 
LLL algorithm. This is because a weaker version of LLL 
reduction without full size reduction is often sufficient for 
lattice decoding. Besides, it can easily be transformed to a 
standard LLL-reduced basis while retaining the O(ri 'logri) 
bound. 

A drawback of the traditional LLL algorithm in digital 
conmiunications is its variable complexity. The worst-case 
complexity could be quite large (see |31] for a discussion 
of the worst-case complexity), and could Umit the speed of 
decoding hardware. To overcome this drawback, we propose 
two fixed-complexity approximations of the LLL algorithm, 
which are based on the truncation of the parallel structures 
of the effective LLL and LLL-deep algorithms, respectively. 
When implemented in parallel, the proposed fixed-complexity 
algorithms allow for higher reduction speed than the sequential 
LLL algorithm. 

In the study of fixed-complexity LLL, we discover an 
interesting relation between LLL and the celebrated vertical 
Bell-labs space-time (V-BLAST) algorithm [32]. V-BLAST 
is a technique commonly used in digital communications 
that sorts the columns of a matrix for the purpose of better 
detection performance. It is well known that V-BLAST sorting 
does not improve the diversity order in multi-input multi- 
output (MIMO) fading channels; therefore it is not thought 
to be powerful enough. In this paper, we will show that V- 
BLAST and LLL are in fact closely related. More precisely, 
we will show that if a basis is both sorted in a sense closely 
related to V-BLAST and size-reduced, then it is reduced in 
the sense of LLL-deep. 

Relation to prior work: The average complexity of real- 
valued effective LLL in MIMO decoding was analyzed by 
the authors in [28]. Jalden et al. [31] gave a similar analysis 
of complex-valued LLL using a different method, yet the 
bound in [31] is less tight. A fixed-complexity LLL algorithm 
was given in [33], while the fixed-complexity LLL-deep was 
proposed by the authors in [34]. The fixed-complexity LLL- 
deep will also help demystify the excellent performance of 
the so-called DOLLAR (double-sorted low-complexity lattice- 
reduced) detector in [35], which consists of a sorted QR 
decomposition, a size reduction, and a V-BLAST sorting. Both 
its strength and weakness will be revealed in this paper 

The rest of the paper is organized as follows. Section II gives 
a brief review of LLL and LLL-deep. In Section III, we present 
the complexity analysis of complex LLL in MIMO decoding. 



Effective LLL and its complexity analysis are given in Section 
IV. Section V is devoted to fixed-complexity structures of LLL. 
Concluding remarks are given in Section VI. 

Notation: Matrices and column vectors are denoted by upper 
and lowercase boldface letters (unless otherwise stated), and 
the transpose, Hermitian transpose, inverse of a matrix B 
by B^, B^, B~^, respectively. The inner product in the 
complex Euclidean space between vectors u and v is defined 
as (u, v) = u^v, and the Euclidean length ||u|| — ^ (u, u). 
3?(x) and denote the real and imaginary part of x, 

respectively. [a;J denotes rounding to the integer closest to x. 
If a; is a complex number, \x\ rounds the real and imaginary 
parts separately. The big O notation /(.x) = 0(g(x)) means 
for sufficiently large x, f{x) is bounded by a constant times 
g{x) in absolute value. 

II. The LLL Algorithm 

Consider lattices in the complex Euclidean space C". A 
complex lattice is defined as the set of points L — {Bx|x G 
Q^}, where B e C"^" is referred to as the basis matrix, 
and ^ = Z + jlj, j = \/^^ is the set of Gaussian integers. 
For convenience, we only consider a square matrix B in this 
paper, while the extension to a tall matrix is straightforward. 
Aside from the interests in digital communications [25], [26] 
and coding [36], complex lattices have found applications in 
factoring polynomials over Gaussian integers [24]. 

A lattice L can be generated by infinitely many bases, from 
which one would like to select one that is in some sense nice 
or reduced. In many applications, it is advantageous to have 
the basis vectors as short as possible. The LLL algorithm is 
a polynomial time algorithm that finds short vectors within 
an exponential approximation factor [20]. The complex LLL 
algorithm, which is a modification of its real counterpart, 
has been described in [23]-[26]. The complex LLL algorithm 
works directly with complex basis B rather than converting it 
into the real equivalent. Although the cost for each complex 
arithmetic operation is higher than its real counterpart, the total 
number of operations required for complex LLL reduction is 
approximately half of that for real LLL [26]. 

A. Gram-Schmidt (GS) orthogonalization (O), QR and 
Cholesky decompositions 

For a matrix B = [bi, b„] e C"^", the classic GSO is 
defined as follows [37] 

bj = bj - ^/ijjbj, fori=l,...,n (1) 

where fiij = (bi, bj)/||bj||^. In matrix notation, it can be 
written as B = B|i^, where B = [bi, b„], and = [Hij] 
is a lower- triangular matrix with unit diagonal elements. 

GSO is closely related to QR decomposition B = QR, 
where Q is an orthonormal matrix and R is an upper-triangular 
matrix with nonnegative diagonal elements. More precisely, 
one has the relations i = ^i j/ri i and b, = ^ • 
where qj is the i-th column of Q. QR decomposition can 
be implemented in various ways such as GSO, Householder 
and Givens transformations [37]. 
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The Cholesky decomposition A = R R computes the R 
factor of the QR decomposition from the Gram matrix A = 
B^B. Given the Gram matrix, the computational complexity 
of Cholesky decomposition is approximately /?>, which is 
lower than 2n? of QR decomposition [37]. 

B. LLL Reduction 

Definition 1 (Complex LLL): Let B = B/x^ be the GSO 
of a complex-valued basis B. B is LLL-reduced if both of the 
following conditions are satisfied: 

|3ff(M»j)l < 1/2 and < 1/2 (2) 

for 1 < j < i < n, and 

iib.ir>(<5-iAiM-iniib.-iir (3) 

for 1 < i < n, where l/2<(5<lisa factor selected to 
achieve a good quality-complexity tradeoff. 

The first condition is the size-reduced condition, while the 
second is known as the Lovasz condition. It follows from the 
Lovasz condition Q that for an LLL-reduced basis 

\\Uf>{5-l/2)\\U-i\\\ (4) 

i.e., the lengths of GS vectors do not drop too much. 

Let a — l/((5— 1/2). A complex LLL-reduced basis satisfies 
[23], [24]: 

llbill < a("-i)/4jjgji/n^^ 

||bi||<a("-i)/2Ai, 
„ (5) 

ni|b»ll < a"^""'^/^detL, 

i=l 

where Ai is the length of the shortest vector in L, and 
det L = det B. These properties show in various senses that 
the vectors of a complex LLL-reduced basis are not too long. 
Analogous properties hold for real LLL, with a replaced with 
/3 = l/((5-l/4) [20]. It is noteworthy that although the bounds 
(|5]l for complex LLL are in general weaker than the real-valued 
counterparts, the actual performances of complex and real LLL 
algorithms are very close, especially when n is not too large. 

A size-reduced basis can be obtained by reducing each vec- 
tor individually. The vector is size-reduced if |5R(/Xfc ;)| < 
1/2 and |5(Mfc,OI < 1/2 for all I < k. Algorithm [T] shows 
how bfc is size-reduced against b; (l < k). To size-reduce bfc, 
we call Algorithm [T] for I = k — 1 down to 1 . Size-reducing 
bfc does not affect the size reduction of the other vectors. 
Furthermore, it is not difficult to see that size reduction does 
not change the GS vectors. 

Algorithm |2] describes the LLL algorithm (see [26] for the 
pseudo-code of complex LLL). It computes a reduced basis by 
performing size reduction and swapping in an iterative manner. 
If the Lovasz condition Q is violated, the basis vectors b^ 
and bfc_i are swapped; otherwise it carries out size reduction 
to satisfy The algorithm is known to terminate in a finite 
number of iterations for any given input basis B and for 6 < I 
[20] (note that this is true even when 5 = 1 [38], [39]). By an 
iteration we mean the operations within the "while" loop in 



Algorithm 1 Pairwise Size Reduction 
Input: Basis vectors hk and b; (I < k) 

GSO coefficient matrix [nij] 
Output: Basis vector b^ size-reduced against b/ 
Updated GSO coefficient matrix [ni.j] 

if mf^k,i)\ > 1/2 or |5(Mfe,/)| > 1/2 then 
bfc := bfe - 
for j = 1, 2, I do 

fJ.k,j A^fcj - \fJ.kA\lJ.L3 



Algorithm 2 LLL Algorithm 
Input: A basis B = [bi, ...b„] 

Output: The LLL-reduced ba- 

sis 

1: compute GSO B = B[^ij]^ 
2: fc := 2 

3: while fc < n do 

4: size-reduce b^ against bfc_i 

5: if Ijbfc + ^fc,fc_ibfe-i|P < (5||bfc-i||2 then 

6: swap bfc and bfc_i and update GSO 

7: fc := max(fc — 1, 2) 

8: else 

9: for ? = fc-2,fc-3,...,l do 
10: size-reduce b^ against b; 

11: fc;=:fc + l 



Algorithm |2] which correspond to an increment or decrement 
of the variable fc. 

Obviously, for real-valued basis matrix B, Definition [T] and 
Algorithm |2] coincide with the standard real LLL algorithm. 
Some further relations between real and complex LLL are 
discussed in Appendix U 

Remark 1: The LLL algorithm can also operate on the 
Gram matrix [40]. To do this, one applies the Cholesky 
decomposition and updates the Gram matrix. Everything else 
remains pretty much the same. 

C. LLL-Deep 

LLL-deep extends the swapping step to all vectors before 
bfc, as shown in Algorithm [3] The standard LLL algorithm 
is restricted to i = fc — 1 in Line 5. LLL-deep can find 
shorter vectors. However, there are no proven bounds for LLL- 
deep other than those for standard LLL. The experimental 
complexity of LLL-deep is a few times as much as that of 
LLL, although the worst-case complexity is exponential. To 
limit the complexity, it is common to restrict the insertion 
within a window [21]. However, we will not consider this 
window in this paper. 

III. Complexity Analysis of Complex LLL 

In the previous work [26], it was only qualitatively argued 
that complex LLL approximately reduces the complexity by 
half, while a rigorous analysis was lacking. In this section, 
we complement the work in [26] by evaluating the computa- 
tional complexity in terms of (complex-valued) floating-point 
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Algorithm 3 LLL Algorithm with Deep Insertion 

Input: A basis B = [bi, ...b„] 

Output: The LLL-deep-reduced 

sis 



ba- 



compute GSO B = B[^ij]^ 
fc := 2 

while fc < n do 

size-reduce hk against bfe_i, ...,b2,bi 

ii3i,l <i < k such that J^j^i ^^lJ\hj\\^ < S\\h,\\^ 

then 

for the smallest such i, insert b^ before b^ and update 
GSO 

fc := max(z, 2) 
else 

fc := fc + 1 



operations (flops). The other operations such as looping and 
swapping are ignored. The complexity analysis consists of 
two steps. Firstly, we bound the average number of iterations. 
Secondly, we bound the number of flops of a single iteration. 

A. Average Number of Iterations 

To analyze the number of iterations, we use a standard 
argument, where we consider the LLL potential [20] 

n-l 

V^\{\\h£^^-'\ (6) 

4 = 1 

Obviously, T) only changes during the swapping step. This 
happens when 



<{5-Wk-i?)\\^k-i\ 



(7) 



for some fc. After swapping, bfe_i is replaced by b^ + 
/^fc.fc-ibfc_i. Thus ||bfc_i||^ as well as T) shrinks by a factor 
less than 6. 

The number K of iterations is exactly the number of Lovasz 
tests. Let and be the numbers of positive and negative 
tests, respectively. Obviously, K = + . Since fc is 
incremented in a positive test and decremented in a negative 
test, and since fc starts at 2 and ends at n, we must have 
K'^ < K~ + {n — 1) (see also [27]). Thus it is sufficient to 
bound A' . 

Let A = maxi ||bi|p and a = min^ ||bi|p. The initial value 
of 2? can be bounded from above by Xo bound the 

number of iterations for a complex- valued basis, we invoke the 
following lemma [20], [27], which holds for complex LLL as 
well. 

Lemma 1: During the execution of the LLL algorithm, the 
maximum A is non-increasing while the minimum a is non- 
decreasing. 

In other words, the LLL algorithm tends to reduce the 
interval [a, A] where the squared lengths of OS vectors reside. 
From Lemma [T] we obtain 



1), A 
o -log- 
z a 



where the logarithm is taken to the base 1/5 (this will be the 
case throughout the paper). 

Assuming that the basis vectors are i.i.d. in the unit ball 
of R", Daude and Vallee showed that the mean of is 
upper-bounded by 0{n^ logn) [27]. The analysis for the i.i.d. 
Gaussian model is similar. Yet, here we use the exact value 
of Vq, which is the initial value of V, to bound the average 
number of iterations. It leads to a better bound than using the 
maximum A in (O. The lower bound on 2? is followed: 

n{n — 1) 



Slower 



log a < V. 



Accordingly, the mean of K is bounded by 

■Do ' 



E K-\ < E 



log: 



-^lower _ 

= [log 2?0]-S [log Slower] 

= E [log Po] - ifcll log E [log a] 



(9) 



We shall bound the two terms separately. 

The QR decomposition of an i.i.d. complex normal random 
matrix has the following property: the squares of the diagonal 
elements Tj , of the matrix R are statistically independent 
random variables with 2{n ~ i + 1) degrees of freedom [41]. 



Since rf 



bj|P, we have 



£;[log2?o] 



n-l 

E 



in-i)E log lib. 



< ^ (n - i) log E 



i=l 
n-l 



(10) 



{n — i) log 2{n — i + 1) 



n(n — 1) 
< ^ ^ M og2n 

where the first inequality follows from Jensen's inequality 
E[\ogX] < \ogE[X]. 

It remains to determine E [log a] . The cumulative distribu- 
tion function (cdf) Fa{x) of the minimum a can be written 

as 



Fa{x) = l-l[[l-F,ix)] 



i=l 



where Fi{x) denotes the cdf of a random variables with 
2i degrees of freedom: 

1 



F^ix) 



2*r(z) 



(11) 



m=0 



As n tends to infinity, Fa{x) approaches a limit cdf Fa{x) 
that is not a function of n. Since a deceases with n, E [log a] 
is necessarily bounded from below by its limit: 



£;[loga] > / \ogxdFa{x). 



(12) 



(8) 

The convergence of Fa (x) is demonstrated in Fig. [T] 
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Fig. 1. Convergence of Fa{x) as n increases. 

Although it is difficult to evaluate the above integral exactly, 
we can derive a lower bound. To do this, we examine the 
behavior of the limit probability density function (pdf) fa{x). 
fa{x) is a decreasing function. Moreover, /a(0) — 1. To show 
this, note that as a: — > 0+ we have the approximation e^^ « 
1 — X and X]m=o ^^"V"^' ~ 1 + a;. Therefore, as a; — > 0+ 

n 

Fa{x) « 1 - lim (1 - x) TT (1 - a;)(l + x) 
= 1 - lim {I ~ x){l ~ x^y^-^ 



and accordingly Fa{x) 
dFa{x) / dx\x=o+ = 1. 
Then we have 



X as a; -> 0+. Then /^(O) 



E [log a] > / log xfa(x)dx 



> 



^Og Xfaix)dx 



(14) 



> / \ogxdx = —1 



where the last inequality follows from f{x) < 1 for x > 0. 
Substituting ([T0ll,([T4ll into ©, we obtain 



£;[i^-]<^^^^(log2n+l). 



(15) 



Therefore, we arrive at the following result: 

Proposition 1: Under the i.i.d. complex normal model of 
the basis matrix B, the total number of iterations of the LLL 
algorithm can be bounded as 



E[K] < n{n- l)(log2n+ 1) + n w n'^logn. 



(16) 



Remark 2: A similar analysis in [31] applied the bounds 
(Ti > \/A and ^/a > an, where cri and an are the maximum 
and minimum singular value of B, respectively. Accordingly, 
the resultant bound log(o'i/cr„) is less tight. In fact, [31] 
showed the bound E[K] < 4n^logn, which is larger by a 
factor of 4. 



B. Number of Flops for Each Iteration 

The second step proceeds as follows. Updating the GSO 
coefficients [20] during the swapping step costs 6(n — fc) + 7 < 
6ri — 5 flops (k > 2), whereas pairwise size reduction for 
(fc, fc - 1) costs 2n + 2(fc - 1) < 4n - 2 flops (k > 2). Testing 
the Lovasz condition as (|4]i costs 3 flops each time. Besides, 
the initial GSO costs 2n^ flops. Therefore, excluding full size 
reduction, the cost is bounded by^ 

Ci < (6n - 5)K- + (4n - 2 + 3)(if- + K+) + 2r? 



< (6n - 5) 



n{n — 1) 



log 2n 



+ (4n+ l)[n{n- l)log2n+ (n - 1)] + 2n^ 
3 

ri(n — 1) log 2n 



(17) 



= 7n^(n - 1) log2n 

+ {An + l)(n - 1) 
< 7n^log2n + 2n^. 



2 

-2n^ 



During each step, the number of flops due to full size 
reduction is no more than 



k-2 



(2?i + 21) < 3n^. 



(18) 



1=1 



Therefore, the subtotal amount of flops due to full size 
reduction are 



n{n — 1) 



log2n +{n~l) 



(19) 



which is 0(ri''logn) and thus is dominant. This results in 
0{n^ logn) complexity bound on the overall complexity C = 
Ci + Ci of the complex LLL algorithm. 

C. Comparison with Real-valued LLL 

In the conventional approach, the complex-valued matrix B 
is converted into real-valued Br (see Appendix Obviously, 
B and Br have the same values of A and a, and same 
convergence speed determined by 5. Since the size of Br is 
twice that of B, real LLL needs about four times as many 
iterations as complex LLL. 

For real-valued LLL, the expressions ( fTTl ) and ( fT9] l are 
almost the same [28], with a doubled size n. Under the 
assumption that on average a complex arithmetic operation 
costs four times as much as a real operation, Ci of complex 
LLL is half of that of real LLL. Meanwhile, when comparing 
C2, there is a subtle difference, i.e., the chance of full size 
reduction (Lines 9 and 10 in Algorithm |2]i is doubled for 
complex LLL [26]. Therefore, C2 of complex LLL is also half. 
Then the total cost of complex LLL is approximately half. 

D. Reduction of the Dual Basis 

Sometimes it might be more preferable to reduce the dual 
basis B* = (B^^)^J, where J is the column-reversing matrix 
[15]. In the following, we show that the O(n^logn) average 
number of iterations still holds. 

^Tlie reason wliy we separately counts Ci and C2 will become clear in the 
next Section. 
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Let B*,A*,a* be the corresponding notations for the dual 
basis. Due to the relation ||bi|| = ||b*_j_|_]^||~"'^ [42], we have 

A _ l/a* _ A* 
"a ^ I /A* ^ IF' 

Thus, the bound on the number of iterations is exactly the 
same. In particular, the average number of iterations is the 
same. 

In MIMO broadcast, it is B^^ that needs to be reduced [43]. 
Noting that B^^ has the same statistics as (B~^)^J because 
B is i.i.d. normal, it is easy to see that the bound on is 
again the same. 

Remark 3: The same conclusion was drawn in [31] by 
examining the singular values. 

IV. Effective LLL Reduction 

A. Effective LLL 

Since some applications such as sphere decoding and SIC 
only require the GS vectors rather than the basis vectors 
themselves, and since size reduction does not change them, 
a weaker version of the LLL algorithm is sufficient for such 
applications. This makes it possible to devise a variant of the 
LLL algorithm that has lower theoretic complexity than the 
standard one. 

From the above argument it seems that we would be able to 
remove the size reduction operations at all. However, this is not 
the case. An inspection shows that the size-reduced condition 
for two consecutive basis vectors 

< 1/2 and < 1/2, l<i<n 

(20) 

is essential in maintaining the lengths of the GS vectors. In 
other words, (|20] | must be kept along with the Lovasz condition 
so that the lengths of GS vectors will not be too short. Note 
that their lengths are related to the performance of SIC and 
the complexity of sphere decoding. We want the lengths to be 
as even as possible so as to improve the SIC performance and 
reduce the complexity of sphere decoding. 

A basis satisfies condition (|20] | and the Lovasz condition (O 
is called an effectively LLL-reduced basis in [29]. Effective 
LLL reduction terminates in exactly the same number of 
iterations, because size-reducing against other vectors has no 
impact on the Lovasz test. In addition to (|4]i, an effectively 
LLL-reduced basis has other nice properties. For example, if 
a basis is effectively LLL-reduced, so is its dual basis [24], 
[29]. 

Effective LLL reduction permits us to remove from Algo- 
rithm|2]the most expensive part, i.e., size-reducing against 
bfc-2, bfc_3, bi (Lines 9-10). For integral bases, doing this 
may cause excessive growth of the (rational) GSO coefficients 
j < and the increase of bit lengths will likely offset 
the computational saving. This is nonetheless not a problem in 
MIMO decoding, since the basis vectors and GSO coefficients 
can be represented by floating-point numbers after all. We use 
a model where floating-point operations take constant time, 
and accuracy is assumed not to perish. There is strong evidence 
that this model is practical, because the correctness of floating- 
point LLL for integer lattices has been proven [40]. Although 



the extension of the proof to the case of continuous bases 
seems very difficult, in practice this model is valid as long as 
the arithmetic precision is sufficient for the lattice dimensions 
under consideration. 

We emphasize that under this condition the effective and 
standard LLL algorithms have the same error performance in 
the application to SIC and sphere decoding, as asserted by 
Proposition |2] 

Proposition 2: The SIC and sphere decoder with effective 
LLL reduction finds exactly the same lattice point as that with 
standard LLL reduction. 

Proof: This is obvious since SIC and sphere decoding 
only need the GS vectors and since standard and effective LLL 
give exactly the same GS vectors. ■ 

B. Transformation to Standard LLL-Reduced Basis 

On the other hand, ZF does require the condition of full size 
reduction. One can easily transform an effective LLL-reduced 
basis into a fully reduced one. To do so, we simply perform 
size reductions at the end to make the other coefficients 
< 1/2 and < 1/2, for 1 < j < i - 1, 

2 < i < n [24], [44]. This is because, once again, such 
operations have no impact on the Lovasz condition. Full size 
reduction costs 0{n^) arithmetic operations. The analysis in 
the following subsection will show the complexity of this 
version of LLL reduction is on the same order of that of 
effective LLL reduction. In other words, it has lower theoretic 
complexity than the standard LLL algorithm. 

There are likely multiple bases of the lattice L that are LLL- 
reduced. For example, a basis reduced in the sense of Korkin- 
Zolotarev (KZ) is also LLL-reduced. Proposition [5] shows that 
this version results in the same reduced basis as the LLL 
algorithm. 

Proposition 3: Fully size-reducing an effectively LLL- 
reduced basis gives exactly the same basis as the standard 
LLL algorithm. 

Proof: It is sufficient to prove the GSO coefficient matrix 
[Hi.j] is the same, since B — B[/iij]-'" and since B is not 
changed by size reduction. We prove it by induction. Suppose 
the new version has the same coefficients /i; j, j < i when 
i — 2, fc — 1. Note that this is obviously true when i ~ 2. 

When i — k and I — fc — 2, the new version makes 
|3?(Aife,fe-2)| < 1/2 and |3(Mfe.fe-2)| < 1/2 at the end by 
subtracting its integral part so that 

mi^k.k-2-\l^k,k-2\)\ < 1/2, \^{fikM-2-\l^k^k-2\)\ < 1/2. 

The standard LLL algorithm achieves this in a number of 
iterations. Yet the sum of integers subtracted must be equal. 
The other coefficients will be updated as 

Aifcj := fJ-kj - \tJ-k,k^2\l^k-2,j, j = 1, fc - 3, 

which will remain the same since coefficients fik-2.j are 
assumed to be the same. 

Clearly, the argument can be extended to the case i ^ k 
and I = fc — 3, - - ,1. That is, the new version also has the 
same coefficients /ifej, j < k. This completes the proof. ■ 
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C. 0{n^ log n) Complexity 

Proposition 4: Under the i.i.d. complex normal model o 
the basis B, the average complexity of effective LLL i 
bounded by Ci in ( fT7] i which is 0{n^ logn). 

Proof: Since the number of iterations is the same, am 
since each iteration of effective LLL costs 0{n) arithmeti 
operations, the total computation cost is 0{n^ log n). Mor 
precisely, the effective LLL consists of the following compu 
tations: initial GSO, updating the GSO coefficients during th 
swapping step, pairwise size reduction, and testing the Lovas 
condition. Therefore, the total cost is exactly bounded by C 
in dnll. I 

To obtain a fully reduced basis, we further run pairwise siz 
reduction for I = k — 2 down to 1 for each fc = 3, • • • , n. Th 
additional number of flops required is bounded by 



X 10 



n fc-2 



k=3 1=1 



21)^ [2n{k - 2) + (fc - l)(fc - 2)] 

fc=3 

4 4 
= -n(n - l)(n - 2) < -n^. 



(21) 



Obviously, the average complexity is still 0{n^ log n). 

Again, since each complex arithmetic operation on average 
requires four real arithmetic operations, the net saving in 
complexity due to complex effective LLL is about 50%. 

In Fig. 12] we show the theoretic upper bounds and ex- 
perimental results for effective and standard LLL algorithms. 
Clearly there is room to improve the analysis. This is because 
our work is in fact a blend of worst- and average-case analysis, 
and the resultant theoretic bound is unlikely to be sharp. But 
nonetheless, the experimental data exhibit cubic growth with 
n, thereby supporting the O(n'^logn) bound. On the other 
hand, surprisingly, the experimental complexity of standard 
LLL reduction is not much higher than that of effective LLL. 
We observed that this is because of the small probability to 
execute size reduction with respect to nonconsecutive vectors 
(Lines 9-10 of the standard LLL algorithm), which were 
thought to dominate the complexity. 

V. Fixed-Complexity Implementation 

Although the average complexity of the LLL algorithm is 
polynomial, it is important to recognize that its complexity 
is in fact variable due to its sequential nature. The worst- 
case complexity could be very large [31], which may severely 
limit the throughput of the decoder In this Section, we 
propose two fixed-complexity structures that are suitable to 
approximately implement the LLL algorithm in hardware. One 
is fixed-complexity effective LLL, while the other is fixed- 
complexity LLL-deep. The structures are based on parallel 
versions of the LLL and LLL-deep algorithms, respectively. 
The parallel versions exhibit fast convergence, which allow to 
fix the number of iterations without incurring much quality 
degradation. 

A. Fixed-Complexity Effective LLL 

Recently, a fixed-complexity LLL algorithm was proposed 
in [33]. It resembles the parallel 'even-odd' LLL algorithm 



e — Standard CLLL, Upper Bound 
-* — Effective CLLL, Upper Bound 
e - Standard CLLL, Experimental 
* - Effective CLLL, Experimental 




Fig. 2. Average number of complex flops for effective CLLL reduction with 
(5 = 3/4 for the i.i.d. noiTnal basis model. 



earlier proposed in [45] (see also [46] for a systolic-array 
implementation) and a similar algorithm in [30]. It is well 
known that the LLL reduction can be achieved by performing 
size reduction and swapping in any order Therefore, the idea 
in [33] was to run a super-iteration where the index k is 
monotonically incremented from 2 to n, and repeat this super- 
iteration until the basis is reduced. This is slightly different 
from the 'even-odd' LLL [45], where the super-iteration is 
performed for even and odd indexes k separately. 

Here, we extend this fixed-complexity structure to effective 
LLL, which is a truncated version of the parallel effective 
LLL described in Algorithm |4] (one can easily imagine an 
'even-odd' version of this algorithm). The index k is never 
reduced in a super-iteration. Of course, one can further run 
full size reduction to make the basis reduced in the sense 
of LLL. It is easy to see that this algorithm converges by 
examining the LLL potential function. To cater for fixed- 
complexity implementation, we run a sufficiently large but 
fixed number of super-iterations. 

How many super-iterations should we run? A crude estimate 
is 0{n^ logn), i.e., the same as that for standard LLL. Since 
there is at least one swap within a super-iteration (otherwise 
the algorithm terminates), the number of super-iterations is 
bounded by O(n^logn) (this is the approach used in [33]). 
However, this approach might be pessimistic, as up to n — 1 
swaps may occur in one super-iteration. Next, we shall argue 
that on average it is sufficient to run 0{nlogn) super- 
iterations in order to obtain a good basis. Accordingly, since 
each super-iteration costs O(n^), the overall complexity is 
0(n^ logn). 

Proposition 5: Let c = ^2^-1-1 ^'^^ complex LLL and 
1 



p- for real LLL. On average the fixed-complexity 



effective LLL finds a short vector with length 



|bi| 



c("-l)/4 , 

< ^deti/"L 



(22) 
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Algorithm 4 Parallel Effective LLL Algorithm 

Input: A basis B = [bi, ...b„] 

Output: Effectively LLL-reduced 

sis 

compute GSO B = B[^ij]^ 
while any swap is possible do 
for k = 2,3, n do 

size-reduce bfe against bfc_i 
if \\hk + fik,k-ihk-i\\^ < 6\\hk-i\\^ then 
swap hk and bfc_i and update GSO 



Algorithm 5 Sorted GSO 



ba- 



after 0{nlogn) super-iterations. 

Proof: The proof is an extension of the analysis of 'even- 
odd' LLL in [47], with two modifications. 
Following [47], define the ratios 



n; 



=1 



c'^(detL)v 



(23) 



max, then the swapping 

' < |M.+Mp)l|b.|p. 



Let Wjnax — max{i;(z), 1 < i < n}. Following [47] one can 
prove that if Wmax > and v{i) > 6v 
condition is satisfied, i.e., ||bi+i 
Thus, after swapping, any such v{i) will be decreased by a 
factor less than i5; all other ratios do not increase. Hence Wmax 
will be decreased by a factor less than 6. 

The first modification is bounding Wmax in the beginning. 
We can see that in the beginning w,nax < A"/det^i < (A/a)" 
(recall A ~ max"^^ and a = min"^^ ||bi|p)-'. 

Secondly, we need to check whether the swapping condition 
||bi+i|p < ((5— |/ii+i^ip)||bip remains satisfied after k goes 
from 2 to i. It turns out to be true. This is because ||bi||^ 
will not decrease; in fact, the new ||bi||^ increases by a factor 
larger than 1/(5 if bi_i and b^ are swapped. Thus, b^ and 
bi+i will always be swapped regardless of the swap between 
hi I and bi, and accordingly v{i) will be decreased. 

Thus, one has Wmax < | after O {n log A — 2 log(det L)) < 
0{n\og A/a) iterations. On average this is O(nlogn). 

In particular, v{l) < 1/6 implies that bi is short, with 
length bounded in ( l22l i. ■ 

Remark 4: This analysis also applies to the fixed- 
complexity LLL proposed in [33] (this is obvious since size 
reduction does not change GSO). 

Remark 5: Compared to (|5]l for standard sequential LLL, 
the approximation factor in ( |22] | is larger by a coefficient 
(^i-jri/2 Yet we can make this factor very small by choosing 
(5^1. For example, if S = 0.99, then (i)"/^ < 1.5 for n up 
to^O. 

Remark 6: In practice, one can obtain a good basis after n 
super-iterations. This will be confirmed by simulation later in 
this section. 

Remark 7: Although we have only bounded the length of 
bi, this suffices for some applications in lattice decoding. For 
example, the embedding technique (a.k.a. augmented lattice 
reduction [19]) only requires a bound for the shortest vector 

^The bound Dmax < ^4" used in [47] does not necessarily hold here since 
det L may be less than 1 for real (complex) valued bases. 



Input: A basis B = [bi, ...b„ 
Output: GSO for 

sis 

let B = B 
for i — 1,2, n do 
k = argmini<„<„ ||b^ 



the 



sorted 



ba- 



exchange the i and fc-th columns of B 
for j = i + 1, n do 

compute the coefficient /i^ = ^Ij'r'll^^ 

update hj bj — /i^jbi 

%% joint sorted GSO and size reduction %% 



% b, := b, 



1) Relation to PMLLL: Another variant of the LLL algo- 
rithm, PMLLL, was proposed in [30], which repeats two steps: 
one is a series of swapping to satisfy the Lovasz condition 
(even forgetting about l/ifc.fc-i | < 1/2), the other is size 
reduction to make \^jLk.k-l\ < 1/2 for k = 2,...,n. This 
variant is similar to parallel effective LLL. However, k does 
not necessarily scan from 2 to n monotonically in PMLLL. 



B. Fixed-Complexity LLL-Deep 

Here, we propose another fixed-complexity structure to 
approximately implement LLL-deep (and, accordingly, LLL). 
More precisely, we apply sorted GSO and size reduction 
alternatively. This structure is closely related to V-BLAST. 

The sorted GSO relies on the modified GSO [37]. At each 
step, the remaining columns of B are projected onto the 
orthogonal complement of the linear space spanned by the 
GS vectors already obtained. In sorted GSO, one picks the 
shortest GS vector at each step, which corresponds to the 
sorted QR decomposition proposed by Wubben et al [48]"* (we 
will use the terms sorted GSO and sorted QR decomposition 
interchangeably). Algorithm |5] describes the process of sorted 
GSO. 

For i ~ l,...,n, let tt^ denote the projection onto the 
orthogonal complement of the subspace spanned by vectors 
bi, bi_i. Then the sorted GSO has the following property: 
bi is the shortest vector among bi, b2, b„; 7r2(b2) is the 
shortest among 7r2(b2), 7r2(b„); and so on ^. 

Sorted GSO tends to reduce max{||bi|j, ||b2||, • • • , ||b„||}. 
In fact, using proof by contradiction, we can show that 
sorted GSO minimizes max{||bi||. ||b2||, • • • , ||b„||}. This is 
in contrast (but also very similar in another sense) to V-BLAST 
which maximizes min{||bi||, ||b2||, • • • , ||b„|j} [32]. 

Following sorted GSO, it is natural to define the following 

"^Note that this is contrary to the well known pivoting strategy where the 
longest Gram-Schmidt vector is picked at each step so that ||bi|| > ||b2|| > 
■■■ > !|b„|| [37]. 

^However, it is worth pointing out that, sorted Gram-Schmidt orthogonal- 
ization does not guai'antee ||bi|| < ||b2|| < ■ ■ ■ < ||b„||. It is only a greedy 
algorithm that hopefully makes the first few Gram-Schmidt vectors not too 
long, and accordingly, the last few not too short. The tenn "sorted" is probably 
imprecise, because the Gram-Schmidt vectors are not sorted in length at all. 
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notion of lattice reduction: 

|3f^(M^,J)| < 1/2, <l/2, for 1 < j < i < n; 

||7ri(bi)|| niin{||7ri(bi)||, ||7ri(bi+i)||, ||7ri(b„)||}, 
for 1 < i < n. 

(24) 

In words, the basis is size-reduced and sorted in the sense of 
sorted GSO. Such a basis obviously exists, as a KZ-reduced 
basis satisfies the above two conditions [42]. In fact, KZ 
reduction searches for the shortest vector in the lattice with 
basis [7ri(bj), 7ri(b„)]. 

The sorting condition is in fact stronger than the Lovasz 
condition. Since b,; + Hi.i-ihi^i = 7r,;_i(bi) and hi i = 
7r,;_i(bi_i), the Lovasz condition with 6 = 1 turns out to be 

lk»-i(b,_i)|| < |K,;_i(b,)||, for 1< ^ < n, (25) 

which can be rewritten as 

||7r,(b0|| = min{|K,(b,)||, |k.(b,+i)|l}, 1 < i< n. (26) 

Obviously, this is weaker than the sorting condition for all 
Gram-Schmidt vectors. Therefore, the reduction notion defined 
above is stronger than LLL reduction even with 6 = 1, but is 
weaker than KZ reduction. 

Meanwhile, when LLL-deep terminates, the following con- 
dition is satisfied: 

fe 

6\\h,\\^ <^f4j\hj\\^ for i<k<n. (27) 

j=i 

If 6 = 1, this is equivalent to ||7ri(bi)|| = 
min{||7ri(bi)||, ||7ri(bi+i)||, ||7ri(b„)||}. Therefore, we 
have 

Proposition 6: The notion of lattice reduction defined in 
(|24] | is equivalent to LLL-deep with 6 = 1 and unbounded 
window size. 

Although this notion is the same as LLL-deep, it offers an 
alternative implementation as shown in Algorithm |6] which 
iterates between sorting and size reduction. Again, we refer 
to sorting and size reduction as a super-iteration, since each 
sorting is equivalent to many swaps. Obviously, the sorted 
GSO preceding the main loop is not mandatory; we show 
the algorithm in this way for convenience of comparison with 
the DOLLAR detector [35] later on. Size reduction does not 
change the GSO, but it shortens the vectors. Thus, after size 
reduction the basis vector bi may not be the shortest vector 
any more, and this may also happen to other basis vectors. 
Then, the basis vectors are sorted again. The iteration will 
continue until reduced basis is obtained in the end. 

Obviously, Algorithm |6] finds a LLL-deep basis if it ter- 
minates. It is not difficult to see that Algorithm |6] indeed 
terminates, by using an argument similar to that for standard 
LLL with 6 = 1 [38], [39]. The argument is that the order of 
the vectors changes only when a shorter vector is inserted, but 
the number of vectors shorter than a given length in a lattice 
is finite. Therefore, the iteration cannot continue forever. We 
conjecture it also converges in O(nlogn) super-iterations; in 
practice it seems to converge in 0{n) super-iterations. Fig. [3] 
shows a typical outcome of numerical experiments on the LLL 



Algorithm 6 Parallel LLL-Deep 
Input: A basis B = [bi, ...b„] 

Output: The LLL-deep-reduced ba- 

sis 

1: sorted GSO of the basis 
2: while there is any update do 
3: size reduction of the basis 
4: sorted GSO of the basis 
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Fig. 3. An example of the potential function against the number of iterations 
for standard LLL with 5=1 and parallel LLL-deep. 

potential function ^ against the number of super-iterations. 
It is seen that parallel LLL-deep could decrease the potential 
more than LLL with 6=1, and the most significant decrease 
occurs during the first few super-iterations. 

As shown in Fig. EJa), the proposed parallel LLL-deep 
has the advantage of a regular, modular structure, and further 
allows for pipeline implementation. Further, it is possible to 
run sorted GSO and size reduction simultaneously in Algo- 
rithm |6l as shown in Fig. Sfb). To do so, we just add Line 
9 in sorted GSO (Algorithm |5]l, which will lead to the same 
reduced basis. It will cost approximately flops. Thus, the 
computational complexity of each super-iteration is roughly 
3n^, 50% higher than that of sorted GSO. Since sorted GSO 
and size reduction are computed simultaneously, the latency 
will be reduced. Further, both sorted GSO and size reduction 
themselves can be parallelized [47]. We can see that while the 
overall complexity might be 0{n'^ log n), the throughput and 
latency of Fig. HJb) in a pipeline structure are similar to those 
of V-BLAST. 

1 ) Using Sorted Cholesky Decomposition: The complexity 
of the proposed parallel LLL-deep in Fig. |4] mostly comes 
from repeated GSO. To reduce the complexity, we can re- 
place it by sorted Cholesky decomposition [49]. This will 
yield the same reduced basis, but has lower computational 
complexity. The complexity of sorted Cholesky decomposition 
is approximately n? /i (the initial multiplication A = B^B 
costs approximately n'^ due to symmetry), while that of sorted 
QR decomposition is approximately 2n^. 
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(b) 

Fig. 4. Fixed-complexity implementation of tlie LLL algorithm, (a) Separate 
sorted GSO and size reduction; (b) joint sorted GSO and size reduction. 

Algorithm 7 Sorted Cholesky Decomposition 
Input: Gram matrix A = B^B 

Output: R factor for the sorted ba- 
sis 

1: let C = A 

2: for i = 1, 2, n do 

3: k = argminj<,„<„ am^m 

4: exchange the i and fc-th columns and rows of C 

o. ^i'\-\:n^i • — ^ 

7: for j = i + I, n do 



Since the sorted Cholesky decomposition was given for the 
dual basis in [49], for the sake of clarity we redescribe it in 
Algorithm [7] Here, Cm,n is the (m, n)-th entry of C, while 
Cm.n denotes its complex conjugate. For convenience, we also 
use MATLAB notation Ci-j^k to denote a vector containing 
those elements of C. When Algorithm |7] terminates, the lower 
triangular part of C is the Hermitian transpose of the R 
factor of the QR decomposition for the basis B. Similarly, 
one can run sorted Cholesky decomposition and size reduction 
simultaneously. 

2) Relation to V-BLAST and DOLLAR Detector: The V- 
BLAST ordering is a well known technique to improve per- 
formance of communication by pre-sorting the columns of a 
matrix [32]. It maximizes the length of the shortest Gram- 
Schmidt vector of B among all n! possible orders. V-BLAST 
ordering starts from the last column of B; it successively 
chooses the i-th Gram-Schmidt vector with the maximum 
length for i — n,n — Several 0{n^) algorithms 

have been proposed. One of them is obtained by applying 
sorted GSO to the dual lattice [49]. This results in significant 
computational savings because only a single GSO process is 
needed. 



Now it is clear that the first iteration of parallel LLL-deep 
is very similar to the DOLLAR detector in [35], which is 
comprised of a sorted QR decomposition, a size reduction, and 
a V-BLAST sorting. Since V-BLAST sorting is very close to 
sorted QR decomposition, replacing V-BLAST ordering with 
sorted QR decomposition does not make much difference. In 
view of this, the DOLLAR detector can be seen as the first- 
order approximation of parallel LLL-deep, which explains its 
good performance. It can be seen from Fig. [3] that the first 
iteration appears to decrease the potential more than any other 
iterations. Of course, using just one iteration also limits the 
performance of the DOLLAR detector 

3) Relation to Standard LLL and Some Variants: In fact, the 
inventors of the LLL algorithm akeady suggested to succes- 
sively make ||bi||, ||b2||, • • • , ||b„|| as small as possible [50], 
since they observed that shorter vectors among bi , b2, • • • , b„ 
typically appear at the end of the sequence. This idea is similar 
to sorted QR decomposition, but in the LLL algorithm it is 
implemented incrementally by means of swapping (i.e., in a 
bubble-sort fashion). 

Joint sorting and reduction [51] is also similar to LLL-deep. 
It is well known that ordering can be used as a preprocessing 
step to speed up lattice reduction. A more natural approach 
is joint sorting and reduction [51] that uses modified GSO 
and when a new vector is picked it picks the one with the 
minimum norm (projected to the orthogonal complement of 
the basis vectors already reduced). That is, it runs sorted GSO 
only once. 

Recently, Nguyen and Stehle proposed a greedy algorithm 
for low-dimensional lattice reduction [52]. It computes a 
Minkowski-reduced basis up to dimension 4. Their algorithm 
is recursive; in each recursion, the basis vectors are ordered 
by increasing lengths. Obviously, following their idea, we 
can define another notion of lattice reduction where the 
basis vectors are sorted in increasing lengths and also size- 
reduced. The implementation of this algorithm resembles that 
of parallel LLL-deep. That is, such a basis can be obtained by 
alternatively sorting in lengths and size reduction. Using the 
same argument as that of LLL-deep, one can show that this 
algorithm will also terminate after a finite number of super- 
iterations. However, it seems difficult to prove any bounds for 
this algorithm. 

4} Reducing the Complexity of Sequential LLL-Deep: 
While the primary goal is to allow parallel pipeline hardware 
implementation, the proposed LLL-deep algorithm also has 
a computational advantage over the conventional LLL-deep 
algorithm even in a sequential computer for the first few 
iterations. We observed that running parallel LLL-deep crudely 
will not improve the speed. While parallel LLL-deep is quite 
effective at the early stage, keeping running it becomes waste- 
ful at the late stage as the quality of the basis has improved a 
lot. In fact, updating occurs less frequently at the late stage; 
thus the standard serial version of LLL-deep will be faster. 
As a result, a hybrid strategy using parallel LLL-deep at the 
early stage and then switching to the serial version at the late 
stage will be more efficient. Parallel LLL-deep can be viewed 
as a preprocessing stage for such a hybrid strategy. One can 
run some numerical experiments to determine when is the best 
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Fig. 5. Average running time for standard and hybrid LLL-deep for th 
complex MIMO lattice. The number of iterations in parallel LLL-deep is set 
to 2. 



time to switch from parallel to sequential LLL-deep. 

In Fig. |5] we show the average running time for LLL-dee 
for the complex MIMO lattice, on a notebook computer witl 
Pentium Dual CPU working at L8 GHz. Sorted GSO is usee 
It is seen that the hybrid strategy can improve the speed by 
factor up to 3 for n < 50. 

C. Bit Error Rate (BER) Performance 

We evaluate the impact of a finite number of super-iteration 
on the performance of parallel LLL and LLL-deep. In the fol 
lowing simulations of the BER performance, we use MMSE 
based lattice decoding and set (5 = 1 in the complex LLl 
algorithm for the best performance. 

Fig. |6] shows the performance of parallel effective LLl 
for different numbers of super-iterations for an 8 x 8 MIM{ 
system with 64-QAM modulation and SIC detection. Th 
performance of ML detection is also shown as a benchmark of 
comparison. It is seen that increasing the number of iterations 
improves the BER performance; in particular, with 8 iterations, 
parallel effective LLL almost achieves the same performance 
as standard LLL. On the other hand, the returning SNR gain 
after the first few iterations is diminishing as the number of 
iterations increases. 

Fig. I?] shows the performance of parallel LLL-deep for 
an 8 X 8 MIMO system with 64-QAM modulation and SIC 
detection. A similar trend is observed. Note that parallel 
LLL-deep with only one super-iteration, which essentially 
corresponds to the DOLLAR detector in [35], does not achieve 
full diversity, despite its good performance. Compared to Fig. 
|6] we can see that the performance is very similar in the end, 
but parallel LLL-deep seems to perform better in the first few 
super-iterations. 

VI. Concluding Remarks 

We have derived the 0{n'^ logn) average complexity of the 
LLL algorithm in MIMO communication. We also proposed 
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Fig. 6. Performance of parallel effective LLL (PELLL) for a 8 X 8 MIMO 
system with 64-OAM and SIC detection. 



10 



10" 



10" 




-O- ' Parallel LLL-deep, 1 super-iteration 
-D- ' Parallel LLL-deep, 2 super-iterations 
Parallel LLL-deep, 4 super-iterations 
-+- ' Parallel LLL-deep, 8 super-iterations 

CLLL 

- - ML 



10 



15 



20 

E,/N„(dB) 



25 



30 



Fig. 7. Performance of parallel LLL-deep for an 8 X 8 MIMO system with 
64-QAM and SIC detection. 

the use of effective LLL that enjoys 0{n^ logn) theoretic av- 
erage complexity. Although in practice effective LLL does not 
significantly reduce the complexity, the 0{n^ logn) theoretic 
bound improves our understanding of the complexity of LLL. 
To address the issue of variable complexity, we have proposed 
two parallel versions of the LLL algorithm that allow trunca- 
tion after some super-iterations. Such truncation led to fixed- 
complexity approximation of the LLL algorithm. The first such 
algorithm was based on effective LLL, and we argued that it is 
sufficient to run O(nlogn) super-iterations. The second was a 
parallel version of LLL-deep, which overcomes the limitation 
of the DOLLAR detector. We also showed that V-BLAST is 
a relative of LLL. 

Finally, we point out some open questions. 

A precise bound on the complexity of LLL remains to be 
found. The 0(n*log?i) bound seems to be loose in MIMO 
communication. 
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Using effective LLL reduction may raise the concern of 
numerical stability, as the GS coefficients will grow. Although 
with the accuracy present in most floating-point implementa- 
tions, this does not seem to cause a problem for the practical 
ranges of n in MIMO communications, a rigorous study on 
this issue (e.g., following [40]) is a topic of future research. 

Although for parallel effective LLL, we proved that the 
first basis vector is short after 0(n log n) super-iterations, 
it remains to show in theory whether full diversity can be 
achieved or not. It is also an open question whether similar 
results exist for parallel LLL-deep. 

Appendix I 

Relations Between Real and Complex LLL 

A. Reducedness 

It is common in literature to convert the complex basis 
matrix B into a real matrix and then apply the standard real 
LLL algorithm. We shall analyze the relationship between this 
approach and complex LLL. There are many ways to convert 
B into a real matrix. One of them is to convert each element 
of B locally, i.e.. 



Bp 



For convenience, we write this conversion as 



(28) 



b; 



(29) 



Br = J-(B). 

Obviously, Br is 2n-dimensional if B is n-dimensional. 
Another conversion is 

3?(B) -3(B 
Q:(B) 5R(B) 

Correspondingly, we write this as 

Bk = -^'(B). 

Note that when the real-valued LLL algorithm is applied to 
Br or BJj, the structures as above are generally not preserved. 

Now suppose the complex matrix B has been complex LLL- 
reduced. We then expand the reduced B as in (|28] | or ( |29] l. 
What can be said about Br and B{^? 

Lemma 2: Let B = B/x^ and Br — ^e the 

Gram-Schmidt orthogonalization of the complex matrix B and 
its real counterpart Br, respectively. Then we have Br = 
J"(B) and = J^{n). 

The proof is omitted. Lemma |2] says that the structure 
in (l28T l is preserved under Gram-Schmidt orthogonalization. 
Using this we now prove the following result. 

Proposition 7: If B is reduced in the sense of complex LLL 
with parameter 6 (1/2 < 6 < 1), then Br is reduced in the 
sense of real LLL with parameter S ~ 1/4. 



Proof: First, note that ||bR^2i-i| 
fi^ looks like 

1 

1 

3ff(A*2,i) -3(M2,i) 1 
3(a*2,i) 3*(a*2,i) 



= ||bR,2^|| = ||bj, and 



(30) 



Obviously the Lovasz condition is satisfied by the (2i — 
1) and (2z)-th column vectors of Br because ||bR^2i|P = 
||bR,2i-i|P > {S - |AiR,2i,2i-iP)||bR,2i-i|P- Let us examine 
(2i) and (2i + l)-th column vectors. From ^ we have 

||bR,2m||' > - m^^^+l.)\'' - m^l^+l,^)\^)\\hR^2^f 

> (^-l/4-|3(^,+i,,)ni|bR,2d|' 

= (5-l/4-|MR,2^+l,2»ni|bR,2.|p. 

(31) 

This proves the Proposition. ■ 
Remark 8: It appears that little can be said about B[^. 
By Proposition |2] even if a 2-dimensional complex lattice B 
is Gauss-reduced (which is arguably the strongest reduction), 
its real equivalent Br is not necessarily LLL-reduced with 
parameter S — I. It is easy to construct such a lattice 



1 

1/2 + j/2 





V2/2 



B. Approximation factors 

Let us compare the approximation factors for the first 
vectors bi and bR i in the reduced bases. Since detB = 
det^BR and the shortest nonzero vectors have the same length 
in the real and complex bases, we only need to compare a"~^ 
with/32"-i. Recall that a = l/((5-l/2) and (3 = l/{5~l/A). 
It is easy to check a> 0^, because 



{5-l/^f > 5-1/2 



(32) 



where the equality holds if and only if 5 — 3/4. Therefore, 
asymptotically, complex LLL has a larger approximation factor 
unless 6 = 3/4. In fact, a minor difference between the 
error performances of real and complex LLL can be observed 
when n becomes large (e.g., n > 10), although they are 
indistinguishable at small dimensions [26]. 
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